Haplotype-resolved chromosomal-level genome assembly of Buzhaye (Microcos paniculata)

Microcos paniculata is a shrub used traditionally as folk medicine and to make herbal teas. Previous research into this species has mainly focused on its chemical composition and medicinal value. However, the lack of a reference genome limits the study of the molecular mechanisms of active compounds in this species. Here, we assembled a haplotype-resolved chromosome-level genome of M. paniculata based on PacBio HiFi and Hi-C data. The assembly contains two haploid genomes with sizes 399.43 Mb and 393.10 Mb, with contig N50 lengths of 43.44 Mb and 30.17 Mb, respectively. About 99.93% of the assembled sequences could be anchored to 18 pseudo-chromosomes. Additionally, a total of 482 Mb repeat sequences were identified, accounting for 60.76% of the genome. A total of 49,439 protein-coding genes were identified, of which 48,979 (99%) were functionally annotated. This haplotype-resolved chromosome-level assembly and annotation of M. paniculata will serve as a valuable resource for investigating the biosynthesis and genetic basis of active compounds in this species, as well as advancing evolutionary phylogenomic studies in Malvales.


Background & Summary
Microcos paniculata Linnaeus (Fig. 1a), known in Chinese as Buzhaye, is a shrub commonly used in traditional Chinese medicine and herbal cooling teas 1 , including Wanglaoji, Huoqizheng 2 and Jiaduobao, with an annual demand of about 250 tons (http://bk.cnpharm.com/zgyyb/2008/04/28/246974.html).The leaves of M. paniculata are also commonly used in ethnomedicinal treatments for food stagnation, damp-heat jaundice and fever 3 .Up to now, numerous studies have extensively investigated the phytochemical composition and pharmacological properties of this species, revealing the existence of bioactive secondary metabolites such as flavonoids, alkaloids, triterpenoids and organic acids 1,4 from M. paniculata extracts.However, due to the lack of a high-quality reference genome, the molecular basis and evolution of the secondary metabolite biosynthesis in M. paniculata are rarely reported 5 .
In the present study, we assembled the genome of M. paniculata using 106 × short reads (42 Gb), 35 × HiFi reads (14 Gb), 75 × Hi-C reads (30 Gb) and 50 × iso-seq reads (20 Gb).The final assembly (~792 Mb) consisted of two complete haplotypes, haplotype A (399.43 Mb) and haplotype B (393.10 Mb), with contig N50 lengths of 43.44 Mb and 30.17Mb, respectively (Table 1).About 99.93% of the assembled sequences were anchored onto 18 (2n) pseudo-chromosomes (Fig. 1b).The chloroplast and mitochondrial genomes were 159,456 bp and 380,905 bp, respectively.A total of 1,080,648 repeat sequences, with an approximate length of 482 Mb were identified, accounting for 60.76% of the assembled genome.Of the identified repeats, long terminal repeats (LTRs) constituted the largest proportion, with a number of 394,112 and a cumulative length of 321,160,287 bp, accounting for 40.52% of the M. paniculata genome assembly (Table 2).The genome contained 65,874 genes, including 49,439 protein-coding genes and 16,435 non-coding genes (Table 3).A total of 48,979 genes were functionally annotated, accounting for 99% of the identified protein-coding genes (Table 4).Of these, 44,971 genes were annotated by all three methods together (Fig. 2).In particular, 639 genes have been annotated as being related to the biosynthesis or metabolism of flavonoids, alkaloids and triterpenoids (Table S1).The resulting high-quality reference genome and annotation of M. paniculata will be a valuable resource for improving our understanding of the evolutionary relationships within the Malvales, for studying the molecular basis and biosynthetic mechanisms of phytochemical compounds, and for further study and exploitation of M. paniculata.Fig. 2 Venn diagram showing the unique and shared functionally annotated protein-coding genes in M. paniculata using the three strategies.

Methods
Sample collection and genome sequencing.Samples of M. paniculata were collected at Xishuangbanna Tropical Botanical Garden (XTBG), Chinese Academy of Sciences, Mengla, Yunnan Province, China.Genomic DNA was extracted using a modified CTAB method 6 .DNA quality was assessed using a NanoDrop One spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and a Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA).Whole genome sequencing, Pacbio sequencing, Hi-C (high-through chromosome conformation capture) sequencing and full-length isoform sequencing (iso-seq) were performed at Wuhan Benagen Technology Co. Ltd. (Wuhan, China).For whole genome sequencing, 1 μg of genomic DNA was sonicated to an approximate size range of 200-400 bp using a sonicator (Covaris, Brighton, UK).The short-read libraries were constructed following the manufacturer's instructions and then sequenced on the DNBSEQ-T7 platform (BGI lnc., Shenzhen, China) using the PE (paired-end) 150 model.
For Hi-C sequencing, leaf material from young shoots was fixed in 2% formaldehyde solution, and the Hi-C library was generated following a published protocol 8 .Briefly, the cross-linked materials were digested with 400 units of MboI, and marked with biotin-14-dCTP, and then subjected to blunt-end ligation of crosslinked fragments.After re-ligation, reverse crosslinking and purification, the chromatin DNA was sheared to a size of 200-600 bp using sonication.The biotin-labelled Hi-C fragments were then enriched using streptavidin magnetic beads.After the addition of A-tailing and an adapter, the Hi-C libraries were PCR-amplified (12-14 cycles) and then sequenced on the DNBSEQ-T7 platform (BGI lnc., Shenzhen, China) in PE150 mode.
Full-length isoform sequencing (iso-seq) was used to obtain high quality transcriptomic data.RNA was extracted from leaves, flowers and stems of M. paniculata using the R6827 Plant RNA Kit (Omega Bio-Tek, Norcross, GA, USA) following the manufacturer's instructions.The cDNA-PCR Sequencing kit SQK-PCS109 by Oxford Nanopore (Oxford Nanopore Technologies, Oxford, UK) was used to prepare full-length cDNA libraries.The libraries were then sequenced on the PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK).

Genome assembly.
PacBio HiFi reads and Hi-C short reads were combined as input to Hifiasm v0.19.5-r592 9 using the default parameters to generate haplotype-resolved contigs for subsequent analysis.Hi-C reads were mapped to the assembled haplotype contigs using Juicer v1.5.6 10 , and a Hi-C-assisted initial chromosome assembly was then performed using the 3D-DNA v180922 11 pipeline (with the parameters --early-exit -m haploid -r 0).Chromosome boundaries were then adjusted and the misjoins and switch errors were corrected manually using Juicebox v1.11.08 12 .This process generated chromosome-scale scaffolds and un-anchored contig sequences.
LR_Gapcloser v1.1.1 13 was used to fill gaps in the chromosome assembly based on HiFi reads (with the parameters -s p -r 2 -g 500 -v 500 -a 0.25).HiFi reads were then re-mapped to the chromosome scaffolds.The mapped reads located around the telomere repeat sequences (TTTAGGG) n 14 were then extracted and assembled into contigs using Hifiasm v0.19.5-r592 with the default parameters.The resulting contigs were aligned back to the chromosome scaffold to extend the chromosome ends for telomere sequences, and totally 28 telomere sequences were obtained (Fig. 3a).In addition, GetOrganelle v1.7.5 15 was used to assemble the chloroplast and mitochondrial genomes.
Nextpolish2 v0.1.0 16was used to polish the above assembly based on HiFi reads and short reads with default parameters.Redundant haplotigs and rDNA fragments were removed using the Redundans v0.13c 17 pipeline (with the parameters -identity 0.98 -overlap 0.8) and manually curated.A high quality haplotype resolved genome assembly of M. paniculata was then obtained.
PASA (Program to Assemble Spliced Alignments) v2.4.1 33 was used to annotate the genomic structure based on transcript evidence with the default parameters.Then, full-length gene sequences were identified by aligning with homologous protein evidence using BLAT 34 (-prot) and removing the hits with query or target coverage <95%.The gene model was trained and optimized for five rounds in AUGUSTUS v3.4.0 35 using the full-length gene set with the default parameters.
The MAKER2 v2.31.9 36 pipeline was used to perform annotation based on ab initio prediction, the transcript evidence and the homologous protein evidence.Briefly: (1) RepeatMasker v4.0.7 19 was used to mask repeat sequences in the genome; (2) AUGUSTUS v3.4.0 35 was used for ab initio prediction based on the genomic sequence; (3) BLASTN was used to align the transcript evidence to the repeat-masked genome, and BLASTX was employed to align the homologous protein evidence to the genome.Exonerate v2.2.0 37 was used to realign the BLAST hits to the genome; (4) Finally, the predicted gene models were integrated using MAKER2 based on the hints generated from the above alignments.
EvidenceModeler (EVM) v1.1.1 38 was further employed to merge the annotation results obtained from PASA v2.4.1 and MAKER2 v2.31.9, generating consensus annotations.TEsorter v1.4.1 39 was utilized to identify TE protein domains on the genome (with the parameters -genome -db rexdb -cov 30 -eval 1e-5 -prob 0.9), and these domains were masked in the EVM process.The results obtained from EVM were refined by incorporating UTR sequences and alternative splicing using PASA v2.4.1 with the default parameters.Annotations that were too short (<50 amino acids), lacked start or stop codons, contained an internal stop codon, or had ambiguous bases were excluded.All annotations were then merged, and redundant annotations were removed.

Data Records
The BGI short reads, PacBio HiFi long reads, Hi-C reads and Iso-Seq data have been deposited at the Sequence Read Archive database of NCBI (National Center for Bioinformation Information) under accession numbers SRR 25456891-SRR25456894 [47][48][49][50] .The final genome assembly has been deposited at the GenBank database under the accession numbers GCA_030664735.1 51 and GCA_030664755.1 52 .The genome annotations are available from    the Figshare repository 53 .The AUGUSTUS model trained and optimized for this genome, together with the configuration files for MAKER are available from the Figshare repository 54 .

technical Validation
We first calculated the mapping rate as a measure of assembly accuracy.The short reads and the long reads were re-mapped to the assembly using BWA-MEM v0.7.17-r1188 55 and Minimap2 v2.24 31 , respectively, with the default parameters.The mapping rates were calculated after filtering out non-primary alignments.In total, 99.89% of HiFi reads, 97.75% of iso-seq reads and 99.81% of short reads were mapped (Table 5).Moreover, the read coverage depth of both short and long read data was evenly distributed along each phased chromosome, indicating high quality of our haplotype-resolved assembly (Figure S1).We evaluated the completeness of the genome assembly using BUSCO (Benchmarking Universal Single-Copy Orthologs) v5.3.2 56 based on the embryophyta_odb10 ortholog database.The BUSCO evaluation of the haplotype A identified 1,591 complete BUSCOs (including 1,561 single and 30 duplicated BUSCOs), accounting for 98.6% of the haplotype, while the missing BUSCOs represented merely 0.7% (Table 6).Similarly, the BUSCO assessment of the haplotype B identified 1,588 complete BUSCOs (including 1,560 single and 28 duplicated BUSCOs), accounting for 98.4% of the haplotype, while the missing BUSCOs were only 0.9% (Table 6).This indicates a relatively complete assembly.We used Merqury v1.3 57    completeness of the genome assembly.Our results gave a consensus quality value (QV) of 73.38 for the genome assembly, and the completeness value was 99.19% (Table 6).We also used KAT (K-mer Analysis Toolkit) v2.4.0 58 to estimate the quality of the genome assembly by comparing k-mers in HiFi reads and in the assembly.Our results show high consistency between the reads and the genome assembly (Fig. 4a), with each haplotype representing approximately half of the heterozygous peak and nearly all of the homozygous peak (Fig. 4b,c).
In addition, we used BUSCO to evaluate the completeness of the genome annotation by retaining only the longest protein sequence for each gene, and found that the annotation of haplotype A was 97.6% complete, with only 17 (1.1%)genes missing, and the annotation of haplotype B was 97.1% complete, with only 19 (1.2%) genes missing (Table 7), indicating that the annotation was of high quality.
The Hi-C reads were aligned to the genome assembly using Juicer v1.5.6 10 with the default parameters.The Juicebox 12 tools pre command (pre -n -q 0 or 1) was used to convert the raw file generated by Juicer into hic format, and dump command (dump observed BP 100000) was used to extract 100-kb contact matrix from the hic file.The hic file was visualized by Juicebox.Strong interactive signals were observed around the diagonal of the pseudo-chromosomes, and there was no obvious noise outside the diagonal (Fig. 5a), indicating the high quality of this chromosome assembly.In addition, no anomalies were observed across each homologous chromosome pair when duplicated reads were excluded (Fig. 5b), suggesting no switch errors between phased haplotypes.

Fig. 1
Fig. 1 Morphological characters (a) and the landscape of genome assembly and annotation of M. paniculata (b).The tracks from outside to inside are: pseudo-chromosomes, density of class I TEs, density of class II TEs, density of protein-coding genes, proportion of tandem repeats, GC content and collinear blocks.

Fig. 3
Fig. 3 Telomere distribution (a) and comparation of genome structure between haplotype A and haplotype B (b).

Fig. 5
Fig. 5 Hi-C interaction heatmap of haplotype A and haplotype B with reads mapping quality ≥0 (including duplicated reads) (a) and mapping quality ≥1 (excluding duplicated reads) (b).The colour bar indicates the strength of the interaction, with yellow representing low and red representing high.

Table 2 .
Summary of repeat elements.

Table 3 .
Summary of M. paniculata genome annotations.

Table 4 .
Functional annotation of protein-coding genes in M. paniculata.

Table 5 .
Summary of mapping rates.

Table 6 .
Evaluation of M. paniculata genome assembly.
to estimate the consensus and

Table 7 .
BUSCO evaluation of M. paniculata genome annotation.